Loading

1. Introduction

First things, first, we need to have a look at the number of observations.

## [1] 1599   12

Turns out there are 1599 observations and 12 variables

2. Univariate Analysis

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

2.1 Univariate Analysis observations

Turns out there are 1599 observations and 12 variables

The plots above provide some detail on the range and distribution of values in the 12 variables that we are dealing with.

The variable we will mainly focus on is obviously the ‘quality’ variable, it seems, quality has been defined by discrete numbers from 3 to 8.

There are:

  1. 10 observation rated 3 on 10
  2. 53 observation rated 4 on 10
  3. 681 observations rated 5 on 10
  4. 638 observations rated 6 on 10
  5. 199 observations rated 7 on 10
  6. 18 observations rated 8 on 10

Distributions

  1. The variables ‘density’, ‘pH’ and ‘sulphates’ seem to have a Unimodal distribution, without zooming in and exploring further, or changing bin widths, they almost look symettric.
  2. The variables ‘free.sulphur.dioxide’, ‘total.sulphur.dioxide’ appear to be right skewed.
  3. The other distributions are hard to define, maybe converting them to a different scale may help.
  4. All variables except ‘density’ and ‘quality’ appear to have outliers in them.

2.2 Outlier Investigation in Univariate Analysis

For the sake of this project an ‘Outlier’ can be defined as point that were above (Mean + 2 X SD) and any points below (Mean - 2 X SD).

The boxplot however will be more stricter in classifying points as outliers and the outliers described by the red dots in the the boxplots are meant only for visualization purposes and not taken literally since when it comes to removing outliers, it is upto each individual’s discretion in deciding which points fall above the threshold of ‘remove because it is an outlier’

2.2.1 Outliers in fixed.acidity (Tartaric acid concentration)

## [1] 0.00750469

‘fixed.acidity’ variable has 0.00750469% of outliers

2.2.2 Outliers in volatile.acidity (Acetic acid concentration)

## [1] 0.006253909

‘volatile.acidity’ variable has 0.006253909% of outliers

2.2.3 Outliers in citric.acid (Citric acid concentration)

## [1] 0.0006253909

‘citric.acid’ variable has 0.0006253909% of outliers

2.2.4 Outliers in residual.sugar (Residual sugar concentration)

## [1] 0.01876173

‘residual.sugar’ variable has 0.01876173% of outliers

2.2.5 Outliers in chlorides (Salt/NaCl concentration)

## [1] 0.01938712

‘chlorides’ variable has 0.01938712% of outliers

2.2.6 Outliers in free.sulfur.dioxide (Free SO2 concentration)

## [1] 0.0137586

‘free.sulfur.dioxide’ variable has 0.0137586% of outliers

2.2.6 Outliers in total.sulfur.dioxide (Total SO2 concentration)

## [1] 0.009380863

‘total.sulfur.dioxide’ variable has 0.009380863% of outliers

2.2.7 Outliers in density

## [1] 0.01125704

‘density’ variable has 0.01125704% of outliers

2.2.8 Outliers in pH value

## [1] 0.005003127

‘pH’ variable has 0.005003127% of outliers

2.2.9 Outliers in sulphates

## [1] 0.01688555

‘sulphates’ variable has 0.01688555% of outliers

## [1] 0.005003127

‘alcohol’ variable has 0.005003127% of outliers

2.3 Univariate Analysis FAQ

1. What is the structure of your dataset?

There are 1599 observations of red wines and 12 variables.

2. What is/are the main feature(s) of interest in your dataset?

‘quality’ is the main feature of the dataset. It is an ordered, categorical variable ranging from 3 to 8, 3 being the worst and 8 being the best.

3. What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The only to know this would be to perform multivariate analysis and check the correlation coefficients. It is hard to make any inference as to which of the other 11 numerical features would impact the main feature. Some may be positive correlation, some negatively, and some would have nothing to do with the main feature what so ever. Before diving any further, one can surmise that the features that pertain to acidity (fixed.acidity, volatile.acidity and citric.acid) will play some role in the quality of red wine.

4. Did you create any new variables from existing variables in the dataset?

Looking into the dataset it would be meaningless to create new features. The three features that pertain to acidity (fixed.acidity, volatile.acidity and citric.acid) could be combined to form a new feature called total.acidity, but then again, each feature pertaining to acidity plays a very specific role. It would be counterproductive to add up one feature related to acidity that is positively correlated and another feature related to acidity that is negative correlated, since each of the acidic features exist as seperate entities. Other candidates for new features would be total.sulphur.dioxide and free.sulphur.dioxide, total SO2 represents the bound SO2 + free SO2, subtracting total - free, would give you the bound SO2, but then again, why seperately create this feature when total.sulphur.dioxide is already a superset that adequately represents bound SO2.

5. Of the features you investigated, were there any unusual distributions?
  1. The variables ‘density’, ‘pH’ and ‘sulphates’ seem to have a Unimodal distribution, without zooming in and exploring further, or changing bin widths, they almost look symettric.
  2. The variables ‘free.sulphur.dioxide’, ‘total.sulphur.dioxide’ appear to be right skewed.
  3. The other distributions are hard to define, maybe converting them to a different scale may help.
  4. All variables except ‘density’ and ‘quality’ appear to have outliers in them.
6. What about outliers?

The top 5 features from highest to lowest percent of outliers are

  1. chlorides: 0.0193%
  2. residual.sugar: 0.0187%
  3. sulphates: 0.0168%
  4. free.sulfur.dioxide: 0.0137%
  5. density: 0.0112%
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

A self explanatory summary of all the variable above

The variable of focus is going to be ‘quality’. This project will endeavour to find the correlations between other variables and our main variable ‘quality’.

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

3. Bivariate Analysis

redwine$quality <-as.numeric(factor(redwine$quality, 
                                    levels=c("3","4","5","6","7","8")))  

ggcorr(redwine, label=TRUE, hjust = 0.9, size = 3, color = "grey50", angle = -45)

redwine$quality <- mapvalues(redwine$quality, from = 
                               c("1","2","3","4","5","6"), 
                             to = c("3","4","5","6","7","8"))
redwine$quality <-as.factor(redwine$quality)

3.1 What variables are correlated and how are they correlated?

Now that we need to understand which variable are correlated with quality, we will now plot those variables in boxplots and a graph that shows the variations in mean and median.

3.1.1 Percentage of Alcohol concentration vs. quality: Positive correlation

Maybe a perfect positive correlation between ‘quality’ and ‘% of alcohol content’ can be observed if there were as many examples for red wine marked with quality ‘5’ as there are examples for qualities ‘6’, ‘7’ and ‘8’. Moreover, there are so many outliers in redwines marked with quality ‘5’, Such wines belonging to the ‘5’ quality mark is begging to be noticed as wines that have a higher alcohol concentration than the mean 9.8%

## # A tibble: 6 x 4
##   quality mean_alch median_alch count_alch
##    <fctr>     <dbl>       <dbl>      <int>
## 1       3  9.955000       9.925         10
## 2       4 10.265094      10.000         53
## 3       5  9.899706       9.700        681
## 4       6 10.629519      10.500        638
## 5       7 11.465913      11.500        199
## 6       8 12.094444      12.150         18

The above table shows the mean and median alcohol concentrations for different quality wines. We will now plot the variations in mean and median alcohol concentrations.

It may be hard to distinguish wines of quality ‘3’, ‘4’ and ‘5’ since the % of alcohol concentration is around the same range. But for wines of quality marked ‘6’, ‘7’ and ‘8’, the alcohol conentration tends to increase.

3.1.2 Citric acid concentration vs. quality: Positive correlation

There seem to be many outliers in red wines that have quality marked as ‘7’ having a low value of citric acid concentration.

## # A tibble: 6 x 4
##   quality mean_citac median_citac count_citac
##    <fctr>      <dbl>        <dbl>       <int>
## 1       3  0.1710000        0.035          10
## 2       4  0.1741509        0.090          53
## 3       5  0.2436858        0.230         681
## 4       6  0.2738245        0.260         638
## 5       7  0.3751759        0.400         199
## 6       8  0.3911111        0.420          18

Let us plot the variations in mean and median citric acid concentration that the above table shows.

As the wine quality increases, citric acid concentration increases. As postulated in the data dictionary, citric acid can add ‘freshness’ and flavor to wines. This increase in concentration of the acid, adds more freshness.

3.1.3 Pottasium sulphate concentration vs. quality: Positive correlation

Again, among 199 wines being marked with ‘quality’ 5, there are so many outliers in them. These outliers tend to pull the mean towards them and makes it a non-robust statistic.

## # A tibble: 6 x 4
##   quality mean_sulphates median_sulphates count_sulphates
##    <fctr>          <dbl>            <dbl>           <int>
## 1       3      0.5700000            0.545              10
## 2       4      0.5964151            0.560              53
## 3       5      0.6209692            0.580             681
## 4       6      0.6753292            0.640             638
## 5       7      0.7412563            0.740             199
## 6       8      0.7677778            0.740              18

As the wine quality increases, sulphates concentration increases. The pottasium sulphate which is the sulphate in question is an additive that acts as an antimicrobial and antioxidant.

3.1.4 Tartaric acid (fixed.acidity) concentration vs. quality : Somewhat positive correlation

The positive correlation is hard to be observed in this boxplot.

## # A tibble: 6 x 4
##   quality mean_fixed.acidity median_fixed.acidity count_fixed.acidity
##    <fctr>              <dbl>                <dbl>               <int>
## 1       3           8.360000                 7.50                  10
## 2       4           7.779245                 7.50                  53
## 3       5           8.167254                 7.80                 681
## 4       6           8.347179                 7.90                 638
## 5       7           8.872362                 8.80                 199
## 6       8           8.566667                 8.25                  18

It is hard to infer the quality of red wine using Tartaric Acid Concentrations because of such a fluctuation in the mean and median Tartaric Acid Concentrations. For redwines marked with the quality ‘7’, the mean and median Tartaric Acid Concentrations is constant at about 8.8

3.1.5 Acetic acid (volatile.acidity) concentration vs. quality : Negative correlation

The negative correlation is evident in the boxplot. As the volatile acidity decreases, the quality seems to increase.

## # A tibble: 6 x 4
##   quality mean_volac median_volac count_volac
##    <fctr>      <dbl>        <dbl>       <int>
## 1       3  0.8845000        0.845          10
## 2       4  0.6939623        0.670          53
## 3       5  0.5770411        0.580         681
## 4       6  0.4974843        0.490         638
## 5       7  0.4039196        0.370         199
## 6       8  0.4233333        0.370          18

The mean and median acetic acid concentration keeps plumetting as the quality keeps increases.

The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste as evident with quality ‘3’ redwines, having high levels of acetic acid.

3.1.6 Salt (NaCl/chlorides) concentration vs. quality: Somewhat negative correlation

It is very hard to infer anything from this boxplot. The graph depicting the mean and median variations in salt concentration will help understand this better. All we can infer from the boxplot is the plethora of outliers in salt concentration for redwines marked with qualities ‘5’ and ‘6’.

## # A tibble: 6 x 4
##   quality mean_chlorides median_chlorides count_chlorides
##    <fctr>          <dbl>            <dbl>           <int>
## 1       3     0.12250000           0.0905              10
## 2       4     0.09067925           0.0800              53
## 3       5     0.09273568           0.0810             681
## 4       6     0.08495611           0.0780             638
## 5       7     0.07658794           0.0730             199
## 6       8     0.06844444           0.0705              18

Visualizing the table above.

Good quality wines tend to have lesser amounts of salt in the as depicted by the graph above. The somewhat negative correlation is confirmed. As the amount of salt in the wine increases, the quality decreases. As the amount of salt in the wine decreases, the quality increases.

3.1.7 Density vs. quality: Somewhat negative correlation

The density of water is close to that of water (1.0) depending on the percent alcohol and sugar content. A decrease in density shows that there is more percent of alcohol in it, more percent of alcohol means, an increased redwine quality.

## # A tibble: 6 x 4
##   quality mean_density median_density count_density
##    <fctr>        <dbl>          <dbl>         <int>
## 1       3    0.9974640       0.997565            10
## 2       4    0.9965425       0.996500            53
## 3       5    0.9971036       0.997000           681
## 4       6    0.9966151       0.996560           638
## 5       7    0.9961043       0.995770           199
## 6       8    0.9952122       0.994940            18

The density values are around a very small range, it is better to visualize this graphically.

Although almost all quality wines have density in ranges, 0.995 to 1, the small differences in densities make up for the quality. Good qualiy wines have densities close to 0.995, while poorer quality wines have densities closer to that of water.

3.1.8 Total SO2 vs. quality: Somewhat negative correlation

## # A tibble: 6 x 4
##   quality mean_tsd median_tsd count_tsd
##    <fctr>    <dbl>      <dbl>     <int>
## 1       3 24.90000       15.0        10
## 2       4 36.24528       26.0        53
## 3       5 56.51395       47.0       681
## 4       6 40.86991       35.0       638
## 5       7 35.02010       27.0       199
## 6       8 33.44444       21.5        18

The quality of redwine varies randomly with the total amount of SO2 in it.. Redwines of quality ‘7’ and ‘4’ tend to have the mean and median total SO2 around the same range.. at 25 g/dm^3

3.1.9 Residual sugar vs. quality: Close to no correlation

## # A tibble: 6 x 4
##   quality  mean_RS median_RS count_RS
##    <fctr>    <dbl>     <dbl>    <int>
## 1       3 2.635000       2.1       10
## 2       4 2.694340       2.1       53
## 3       5 2.528855       2.2      681
## 4       6 2.477194       2.2      638
## 5       7 2.720603       2.3      199
## 6       8 2.577778       2.1       18

The quality of redwine varies randomly with the amount of residual sugar in it proving the existence of little to no correlation.

3.1.10 pH values vs. quality: Negative correlation

## # A tibble: 6 x 4
##   quality  mean_pH median_pH count_pH
##    <fctr>    <dbl>     <dbl>    <int>
## 1       3 3.398000      3.39       10
## 2       4 3.381509      3.37       53
## 3       5 3.304949      3.30      681
## 4       6 3.318072      3.32      638
## 5       7 3.290754      3.28      199
## 6       8 3.267222      3.23       18

##  Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

3.2 Bivariate Analysis FAQ

1. Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

According to the correlation matrix in the multivariate section, some things about the variable ‘quality’ are apparent, and are confirmed in the graphs of the bivariate sections

  1. Positive correlation with variables ‘alcohol’, ‘sulphate’ and ‘citric.acid’
  2. Somewhat positive correlation with variable ‘fixed.acidity’ (low +ve correlation indicated by a dim blue color in the scatter matrix)
  3. Negative correlation with variables ‘volatile.acidity’.
  4. Somewhat negative correlation with variables ‘chlorides’, ‘density’, ‘total.sulphur.dioxide’.
  5. Close to no correlation with variables such as ‘residual.sugar’, ‘free.sulphur.dioxide’ and ‘pH’.
2. Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
  1. ‘fixed.acidity’ is negatively correlated with ‘pH’, positively correlated with ‘citric.acid’ and positively correlated with ‘density’.
  2. ‘volatile.acidity’ is negatively correlated with ‘citric.acid’.
  3. ‘citric.acid’ is negatively correlated with ‘pH’.
  4. free.sulphur.dioxide and ‘total.sulphur.dioxide’ are positively correlated.
  5. ‘density’ is positively correlated with ‘fixed.acidity’, negatively correlated with ‘alcohol’
3. What was the strongest relationship you found?

The variable ‘quality’ has the strongest positive correlation with the variable ‘alcohol’

4. Multivariate Analysis: Knowing that alcohol content is most strongly correlated with quality of red wine, how does the quality of red wine differ given alcohol content and changing other variables?

We jump to multivariate analysis so we can identify the correlations between variables. This scatterplot matrix will help understand which variables to may attention to.

Multivariate Analysis observations

According to the matrix, some things about the variable ‘quality’ are apparent

  1. Positive correlation with variables ‘alcohol’, ‘sulphate’ and ‘citric.acid’
  2. Somewhat positive correlation with variable ‘fixed.acidity’ (low +ve correlation indicated by a dim blue color in the scatter matrix)
  3. Negative correlation with variables ‘volatile.acidity’.
  4. Somewhat negative correlation with variables ‘chlorides’, ‘density’, ‘total.sulphur.dioxide’.
  5. Close to no correlation with variables such as ‘residual.sugar’, ‘free.sulphur.dioxide’ and ‘pH’.

The main variable (Quality rating out of 10) in focus is an integer, it would be helpful to convert it to a factor.

4.1 Sulphates vs. % of Alcohol content and quality

Since, alcohol and sulphates are both positive correlated with quality, let’s pair them up against each other and group them by quality.

A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.

It is clear from the above graphs that, for higher qualities of wine rated ‘6’ onwards tend to have more sulphate content in them irrespective of alcohol content.

4.2 Density vs. % of Alcohol content and quality

Since, alcohol is positive correlated with quality, density is negatively correlated with quality. Let’s pair them up against each other and group them by quality.

A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.

One main distinction between red wines marked with a rating ‘3’ and those marked with a rating ‘8’ is the difference in their densities. High quality red wines have densities dropping below 0.995.

4.3 Residual sugar content vs. % of Alcohol content and quality

Since, Residual sugar content has no correlation with quality, while alcohol content has a positive correlation with quality. Let’s pair them up against each other and group them by quality.

A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.

It is very hard to make distinctions between quality based on alcohol content and residual sugars, there appears to be a lot of overlap.

4.4 pH value vs. % of Alcohol content and quality

Since, alcohol is positive correlated with quality, pH value is somewhat negatively correlated with quality. Let’s pair them up against each other and group them by quality.

A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.

Red wines with quality rating ‘6’,‘7’ and ‘8’ are very close to when it comes to their pH values, you can see the lines for those corresponding qualities moving in the same path and ending close to each other.

ACIDS VS. ALCOHOL CONTENT VS QUALITY (4.5 - 4.7)

4.5 Acetic acid vs. % of Alcohol content and quality

We already know that red wines having high concentrations of acetic acid are poor quality since they tend to have a vinegar taste. Let’s draw a graph that pits it against % of alcohol content grouped by quality.

A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.

One distinction that we can easily making here, is that, poor quality alcohols rated ‘3’ have high amounts of acetic acid in them as denoted by line. Other higher quality red wines with ratings ‘6’, ‘7’ and ‘8’ have acetic acid concentration around the same mark.

4.6 Tartaric acid vs. % of Alcohol content and quality

It is hard to infer the quality of red wine using Tartaric Acid Concentrations ALONE because of such a fluctuation in the mean and median Tartaric Acid Concentrations. For redwines marked with the quality ‘7’, the mean and median Tartaric Acid Concentrations is constant at about 8.8. Let us now analyse how the quality changes with % of alcohol content

A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.

Good quality red wines rated ‘6’, ‘7’ and ‘8’ with alcohol content from 12.5% onwards have the almost the same concentration of Tartaric acid in them.

4.7 Citric acid vs. % of Alcohol content and quality

As the wine quality increases, citric acid concentration increases. As postulated in the data dictionary, citric acid can add ‘freshness’ and flavor to wines. This increase in concentration of the acid, adds more freshness. Lets see its variation with alcohol content.

A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.

Among the best quality red wines rated ‘6’, ‘7’ and ‘8’, as the alcohol content in them increases, the citric acid concentration falls. However, when viewing citric acid alone vs. quality, the better quality red wines merely had high citric acid concentrations. Among better wines, the increase in alcohol concentration offsets the decrease in citric acid concentrations.

4.8 Salt content vs. % of Alcohol content and quality

A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.

The salt content is almost the same in red wines rated ‘7’ and ‘8’. Otherwise it is hard to come to meaningful conclusions just based on salt content.

4.9 Multivariate Analysis FAQ

1. Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
  1. Among the best quality red wines rated ‘6’, ‘7’ and ‘8’, as the alcohol content in them increases, the citric acid concentration falls. However, when viewing citric acid alone vs. quality, the better quality red wines merely had high citric acid concentrations. Among better wines, the increase in alcohol concentration offsets the decrease in citric acid concentrations.

  2. One distinction that we can easily making here, is that, poor quality alcohols rated ‘3’ have high amounts of acetic acid in them as denoted by line. Other higher quality red wines with ratings ‘6’, ‘7’ and ‘8’ have acetic acid concentration corresponding to the respective alcohol content.

  3. It is very hard to make distinctions between quality based on alcohol content and residual sugars, there appears to be a lot of overlap. There appears to be no correlation between residual sugars and quality of red wine whatsoever.

2. Were there any interesting or surprising interactions between features?

It is clear from the above graphs that, for higher qualities of wine rated ‘6’ onwards tend to have more sulphate content in them irrespective of alcohol content. This is what makes great quality red wines great.

Final Plots

Plot no. 1: What makes terrible quality red wines terrible

Description for Plot no. 1:

There are so many correlated variables, and when performing multivariate analysis, we see that we can find one thing that completely stands out in poor quality red wines. Poor quality read wines that have been rated ‘3’ on a scale of 10 tend to have higher concentrations of acetic acid. The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

Plot no. 2: What makes great quality red wines great

Description for Plot no. 2:

There are so many correlated variables, and when performing multivariate analysis, we see that we can find one thing that completely stands out in great quality red wines. Great quality read wines that have been rated ‘7’ and ‘8’ on a scale of 10 tend to have higher concentrations of Pottasium sulphates irrespective of the alcohol concentrations.

Plot no. 3: What doesn’t matter in red wines

Description for Plot no. 3:

The quality of redwine varies randomly with the amount of residual sugar in it proving the existence of little to no correlation between the two variables. The boxplot helps understand the variance in each category.

Reflections

Challenges faced

  1. The differences in number of observations for each class made it difficult to analyse the quality. There were no perfectly obvious or perfect correlations, and that is probably because of this. For example, there are 10 observation rated 3 on 10 but 681 observations rated 5 on 10. This makes it hard to ascertain for sure the quality of poor wines since there may be very few training examples.

  2. Python offer more aesthetically pleasing plots, graphs and charts with seaborn package. That was not available in R. (or maybe I haven’t discovered them)

  3. Using geom_smooth instead of geom_point made a ton of difference to articulate useful observations in the Multivariate Analysis section (Division number by 4 and its subdivisions)

  4. Sometimes, multiple legends depicting the same thing popped up in graphs, using the write parameter made a ton of difference.

  5. There were so many ways to visualize data, that merely picking the right type of visualization was a challenge.

  6. Although ordered categorical variables needs to be displayed using different color palettes, it makes it hard to visualize the difference between the magnitudes in the category since there is so much overlap between lightly shaded lines of the same color and a darker shade of the same color.

Future works

  1. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). It would help to know how each wine expert rated them. That would tell more about the wine expert itself! One wine expert would’ve given more higher ratings to wine with more alcohol content. Another wine expert would’ve rated higher on a red wine with a higher acetic acid concentration.

  2. It would also help to know what varieties of grapes that were used and in what proportion, since grapes are the core ingredients of red wine.

References

  1. Stackoverflow
  2. http://www.cookbook-r.com/Graphs/Bar_and_line_graphs_(ggplot2)/
  3. qplot guides in Google